Intro to Pandas DataFrames

DataFrames are a lovely way to store data. They're essentially matrices that can store almost any type of data and are a great option for handling data where you want to keep track of rows and columns with labels. The Pandas library also comes with a handy host of functions that allow you to work with your DataFrames in very smart ways (eg. means, split-apply-combine).

The Pandas website has a lot of great documentation to help you get started. However, even after reading the official tutorials, I was stumped on some problems using DataFrames. This lesson is based directly off of how I solved those problems, so hopefully some will find it helpful.


Creating DataFrames

You can instantiate DataFrames without any data or with data from any number of sources


In [1]:
import pandas as pd
import numpy as np

blank = pd.DataFrame()

blank


Out[1]:

In [2]:
movies = pd.DataFrame(np.zeros((4,4)), index=['Forrest Gump', 'Scanners', '2010: Odyssey Two', 'Fern Gully'], columns = ['Date Released', 'Box Office Gross', 'IMDB Score', 'Tomatometer'])

movies


Out[2]:
Date Released Box Office Gross IMDB Score Tomatometer
Forrest Gump 0 0 0 0
Scanners 0 0 0 0
2010: Odyssey Two 0 0 0 0
Fern Gully 0 0 0 0

In [3]:
zeros = pd.DataFrame(np.zeros((3,5)))

zeros


Out[3]:
0 1 2 3 4
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0

Got a csv file you want to read? No problem!


In [4]:
probedata = pd.read_csv('12probe20cm.csv')

probedata[1:4]


Out[4]:
Animal Date Group Time to platform Trial duration Distance travelled (cm) Average speed % time near walls Direction Average Proximity Cumulative Proximity Whishaw's corridor Platform Quadrant Zone 1 % Target Zone % Zone 3 % Zone 4 %
1 m2 2/1/2015 GFP 8.235 59.898 1298.611 21.680 2.516 52.129 47.077 1232.864 37.736 4 4.822 50.943 15.514 0.629
2 m3 2/1/2015 GFP 0.000 59.944 1562.190 26.061 2.096 -0.502 36.535 1799.688 54.717 4 5.451 34.382 10.273 3.774
3 m4 2/1/2015 APP 7.102 59.896 1157.948 19.333 0.000 51.561 41.952 1392.255 39.203 4 11.530 37.945 16.352 0.210

Slicing DataFrames

You can slice DataFrames just like an array, and can also include some more advanced criteria


In [5]:
movies['Date Released'] = ['1994','1981','1984','1992']
movies['Box Office Gross'] = [50,12,8,16]
movies['IMDB Score'] = [8.8,6.8,6.8,6.4]
movies['Tomatometer'] = [.72,.80,.66,.71]

movies[1:2]

#OR

movies.iloc[1:2]


Out[5]:
Date Released Box Office Gross IMDB Score Tomatometer
Scanners 1981 12 6.8 0.8

In [6]:
movies[['IMDB Score','Tomatometer']]


Out[6]:
IMDB Score Tomatometer
Forrest Gump 8.8 0.72
Scanners 6.8 0.80
2010: Odyssey Two 6.8 0.66
Fern Gully 6.4 0.71

In [7]:
movies[movies['IMDB Score']<(movies['Tomatometer']*10)]


Out[7]:
Date Released Box Office Gross IMDB Score Tomatometer
Scanners 1981 12 6.8 0.80
Fern Gully 1992 16 6.4 0.71

Adding new elements

Columns and rows can be added easily (relatively)

Columns can be added by setting a new column to another DataFrame. Just make sure that the indices are compatible!


In [8]:
favlist = pd.DataFrame([True,True,True,True],index=movies.index)

movies['Childhood Top 10?'] = favlist

movies


Out[8]:
Date Released Box Office Gross IMDB Score Tomatometer Childhood Top 10?
Forrest Gump 1994 50 8.8 0.72 True
Scanners 1981 12 6.8 0.80 True
2010: Odyssey Two 1984 8 6.8 0.66 True
Fern Gully 1992 16 6.4 0.71 True

Rows can be added by using the append function on a DataFrame, taking an appropriately index and columned DataFrame that will be added on the end of the first DataFrame.


In [9]:
wildwildwest = pd.DataFrame(index=['Wild Wild West'], columns=movies.columns)

wildwildwest.iloc[0] = ['1999', 2, 4.8,.17,False]

movies.append(wildwildwest)


Out[9]:
Date Released Box Office Gross IMDB Score Tomatometer Childhood Top 10?
Forrest Gump 1994 50 8.8 0.72 True
Scanners 1981 12 6.8 0.8 True
2010: Odyssey Two 1984 8 6.8 0.66 True
Fern Gully 1992 16 6.4 0.71 True
Wild Wild West 1999 2 4.8 0.17 False

The concat function is also very useful. It can handle DataFrames with different indices and/or columns. There are multiple ways to joining the indices and columns however you would like


In [11]:
pd.concat([movies,movies])


Out[11]:
Date Released Box Office Gross IMDB Score Tomatometer Childhood Top 10?
Forrest Gump 1994 50 8.8 0.72 True
Scanners 1981 12 6.8 0.80 True
2010: Odyssey Two 1984 8 6.8 0.66 True
Fern Gully 1992 16 6.4 0.71 True
Forrest Gump 1994 50 8.8 0.72 True
Scanners 1981 12 6.8 0.80 True
2010: Odyssey Two 1984 8 6.8 0.66 True
Fern Gully 1992 16 6.4 0.71 True

The Amazing GroupBy

Pandas.GroupBy is a great function that allows you to process your data in many different ways without having to get fancy or write any loops.

Essentially, it carries out three different steps:

Splitting the data into different groups [eg. treatment condition]

Applying some function to your data [eg. mean]

Combining the results back into another DataFrame

In my work, this function was extremely useful when I wanted to obtain the mean time spent in target zone for each of my treatment groups


In [12]:
probedata


Out[12]:
Animal Date Group Time to platform Trial duration Distance travelled (cm) Average speed % time near walls Direction Average Proximity Cumulative Proximity Whishaw's corridor Platform Quadrant Zone 1 % Target Zone % Zone 3 % Zone 4 %
0 m1 2/1/2015 APP 10.724 59.988 1482.678 24.716 6.918 -117.097 45.391 2064.426 46.541 4 2.935 17.820 31.447 8.805
1 m2 2/1/2015 GFP 8.235 59.898 1298.611 21.680 2.516 52.129 47.077 1232.864 37.736 4 4.822 50.943 15.514 0.629
2 m3 2/1/2015 GFP 0.000 59.944 1562.190 26.061 2.096 -0.502 36.535 1799.688 54.717 4 5.451 34.382 10.273 3.774
3 m4 2/1/2015 APP 7.102 59.896 1157.948 19.333 0.000 51.561 41.952 1392.255 39.203 4 11.530 37.945 16.352 0.210
4 m5 2/1/2015 GFP 14.931 59.912 1310.405 21.872 0.000 -21.198 70.274 1550.027 49.476 4 6.709 35.430 18.658 5.451
5 m6 2/1/2015 APP 17.235 59.948 1348.853 22.500 0.419 -131.100 29.144 1747.719 61.845 4 6.080 35.220 14.046 8.386
6 m7 2/1/2015 GFP 5.086 59.897 1402.622 23.417 0.839 30.670 44.798 1832.923 45.073 4 7.128 35.849 18.658 5.031
7 m8 2/1/2015 APP 3.899 59.969 1686.095 28.116 6.289 115.868 49.414 2482.494 62.474 4 3.774 15.094 27.254 13.627

In [13]:
grouped = probedata.groupby(['Group']).mean()

grouped


Out[13]:
Time to platform Trial duration Distance travelled (cm) Average speed % time near walls Direction Average Proximity Cumulative Proximity Whishaw's corridor Platform Quadrant Zone 1 % Target Zone % Zone 3 % Zone 4 %
Group
APP 9.740 59.95025 1418.8935 23.66625 3.40650 -20.19200 41.47525 1921.7235 52.51575 4 6.07975 26.51975 22.27475 7.75700
GFP 7.063 59.91275 1393.4570 23.25750 1.36275 15.27475 49.67100 1603.8755 46.75050 4 6.02750 39.15100 15.77575 3.72125

In [14]:
import matplotlib.pylab as plt
    import seaborn as sns
    
    sns.set_context("talk")
    grouped[['Zone 1 %','Target Zone %','Zone 3 %','Zone 4 %']].plot(kind='bar')
    plt.title('Time spent in Zone')
    plt.ylabel('Time (sec)')
    plt.xlabel('Zone')
    locs, labels = plt.xticks()
    plt.setp(labels,rotation=0)
    plt.show()

In [ ]: